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Abstract 

In  Phys.  Rev.  Letters,  73:2.  5  Dec.  94,  Mantegna  et  al.  conclude  on  the  basis  of  Zipf  rank  frequency 
data  that  noncoding  DNA  sequence  regions  are  more  like  natural  languages  than  coding  regions.  We  argue 
on  the  contrary  that  an  empirical  fit  to  Zipf’s  “law”  cannot  be  used  as  a  criterion  for  similarity  to  natural 
languages.  Although  DNA  is  a  presumably  an  “organized  system  of  signs”  in  Mandelbrot’s  (1961)  sense, 
an  observation  of  statistical  features  of  the  sort  presented  in  the  Mantegna  et  al.  paper  does  not  shed  light 
on  the  similarity  between  DXA’s  “grammar”  and  natural  language  grammars,  just  as  the  observation  of 
exact  Zipf-like  behavior  cannot  distinguish  between  the  underlying  processes  of  tossing  an  M  sided  die  or 
a  finite-state  branching  process. 
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In  Pkys,  Review  Letters.  73:2.  5  Dec.  94,  Mantegna 
et  al.  '‘extend  the  Zipf  approach  to  analyzing  linguistic 
texts  to  the  statistical  study  of  DNA  base  pair  sequences 
and  find  that  the  noncoding  regions  are  more  similar  to 
natural  languages  than  the  coding  sequences"  (p.  3169). 
Specifically,  the  authors  analyze  coding/noncoding  DNA 
sequences  and  conclude  that  noncoding  regions  show  a 
more  Zipf-like  behavior  than  coding  regions.  Asserting 
that  “A  remarkable  feature  of  languages  is  Zipf’s  law” 
(p.  3169),  they  further  conclude  that  noncoding  regions 
are  more  similar  to  natural  languages  than  coding  re¬ 
gions  (p.  3170): 

The  averages  for  each  category  support  the 
observation  that  ^  is  consistently  larger  for 
the  noncoding  sequences,  suggesting  that  the 
noncoding  sequences  bear  more  resemblance 
to  a  natural  language  than  the  coding  se¬ 
quences. 

Their  result  has  received  popular  notice  in  both  Science 
(266,  p.  1320,  25  Nov.  1994)  and  Scientific  American 
(272(3),  March,  1995). 

In  this  note  we  would  like  to  argue  that  the  Man¬ 
tegna  et  al.  conclusion  is  rather  farfetched.  Noncoding 
DNA  sequences  do  not  show  much  similarity  to  natural 
languages.  Rather,  as  far  as  one  can  judge  from  the  ev¬ 
idence  of  the  Mantegna  et  al.  paper,  all  one  can  say — if 
their  statistical  analysis  is  not  in  question,  which  it  may 
well  be — is  that  noncoding  DNA  sequences  and  natural 
languages  combine  discrete  symbols  to  form  strings  that 
obey  Zipf  ^s  law.  But  this  is  of  course  what  we  knew  from 
the  outset.  In  particular: 

•  Any  number  of  random  processes  outputting  dis¬ 
crete  symbols  can  display  Zipf-like  behavior  with¬ 
out  bearing  any  resemblance  to  the  special  genera¬ 
tive  processes  currently  believed  to  govern  sentence 
formation  (word  sequences)  in  natural  languages. 
In  this  sense  Zipf’s  law  is  not  peculiar  to  natural 
languages  at  all,  and  therefore  cannot  be  used  as  a 
strong  test  for  whether  DNA.  or  anything  else  for 
that  matter,  has  something  “in  common  with  nat¬ 
ural  languages.”  Indeed,  exactly  this  same  point 
was  made  at  length  over  30  years  ago  by  Mandel¬ 
brot  (1961)  in  his  familiar  discussion  of  Zipf’s  law: 

Further,  because  statistical  and  gram¬ 
matical  structures  seem  uncorrelated,  in 
the  first  approximation,  one  might  ex¬ 
pect  to  encounter  laws  which  are  inde¬ 
pendent  of  the  grammar  of  the  language 
under  consideration.  Hence,  from  the 
viewpoint  of  significance  (and  also  of  the 
mathematical  method)  there  would  be 
an  enormous  difference  between:  on  the 
one  hand,  the  collection  of  data  that  are 
unlikely  to  exhibit  any  regularity  other 
than  the  approximate  stability  of  the 
relative  frequencies,  when  different  sam¬ 
ples  are  compared  [i.e..  data  leading  to 
statistical  laws  like  Zipf’s  law:  our  com¬ 
ments  pn/rcb];  and,  on  the  other  hand, 
the  study  of  laws  that  are  valid  for  natu¬ 


ral  discourse  [the  discovery  of  such  laws 
being  the  goal  of  linguistics  pn/rcb]  but 
not  for  other  organized  systems  of  signs. 

(p.  213) 

As  is  also  familiar  and  as  we  show  by  examples  be¬ 
low,  it  is  quite  easy  to  generate  Zipf-like  distributions 
from  very  simple  generative  processes  that  are  quite  un¬ 
like  natural  languages,  e.g.,  tossing  an  M-sided  die  and 
particular  very  simple  finite-state  branching  processes.^ 
In  short,  although  DNA  is  a  presumably  an  “organized 
system  of  signs  in  Mandelbrot’s  sense,  an  observation 
of  statistical  features  of  the  sort  presented  in  the  Man¬ 
tegna  et  al.  paper  does  not  shed  light  on  the  similarity 
between  DNA's  “grammar”  and  natural  language  gram¬ 
mars,  just  as  the  observation  of  exact  Zipf-like  behavior 
cannot  distinguish  between  the  underlying  processes  of 
tossing  an  M  sided  die  or  a  finite-state  branching  pro¬ 
cess.  An  empirical  fit  to  Zipf’s  law  cannot  be  used  as  a 
criterion  for  similarity  to  natural  languages. 

•  Zipf’s  law  is  given  by  fr  ~  C  where  /  is  the  fre¬ 
quency  of  any  word,  and  r  is  its  rank,  with  words 
arranged  from  most  frequent  to  least  frequent.  In 
other  words  ln(/)  =  K  —  ^ln(r),  (with  ^  ==  1). 
The  authors  find  that  ^  is  0.286  for  coding  regions, 
and  0.386  for  noncoding  regions,  and  0.57  for  nat¬ 
ural  languages.  Without  further  statistical  tests,  it 
is  not  unreasonable  to  conclude  that  both  coding 
and  noncoding  DNA  sequences  are  more  alike  to 
each  other  than  either  is  to  natural  languages,  and 
that  Zipf's  law  is  violated.  What  is  plainly  required 
are  the  usual  significance  tests  addressing  precisely 
this  question,  e.g.,  the  null  hypothesis  that  coding 
^  is  the  same  as  natural  language  Since  the  vari¬ 
ances  are  clearly  available,  the  authors  or  others 
should  be  possible  to  carry  the  required  tests  on 
the  original  data. 

•  As  a  minor  point,  in  fact  the  two  measures  used  in 
the  paper— Zipf  behavior,  and  Shannon  entropy — 
are  exactly  correlated.  Therefore  it  is  not  surpris¬ 
ing  that  given  Zipf-like  behavior  for  noncoding  se¬ 
quences.  one  would  also  observe  that  noncoding  re¬ 
gions  have  lower  entropy  than  coding  regions.  In 
effect,  there  is  just  one,  not  “two  similar  statistical 
properties"  (p.  3172)  that  natural  languages  and 
noncoding  sequences  share  (if  they  share  it  at  all), 
namely,  Zipf-like  behavior  (or  lower  entropy). 

For  a  finite  number  of  “words,”  entropy  is  largest 
for  a  uniform  distribution  over  word  frequencies. 
The  more  skewed  the  word  frequencies,  the  lower 


Mndeed,  as  N.  Chomsky  points  out  (p.c.),  if  we  take  a  col¬ 
lection  of  English  sentences  and  define  “words”  by  taking  the 
strings  starting  with,  say,  “e”  and  ending  with  “e”  then  the 
resulting,  more  random  collection  of  “words”  shows  a  better 
fit  to  Zipf’s  “law”— precisely  because  there  are  no  interfering 
effects  from  the  more  organized  features  of  natural  language 
words.  On  this  view,  the  closer  fit  of  noncoding  sequences  to 
a  Zipf  distribution  actually  means  that  noncoding  DNA  se¬ 
quences  are  more  random  and  more  unlike  natural  languages 
than  coding  sequences — exactly  the  opposite  conclusion  that 
Mantegna  maintain. 


the  entropy.  For  coding  regions  (with  f  ==  0.286), 
the  word  frequencies  fall  off  more  slowly  with  rank 
than  for  noncoding  regions  =  0.386).  Conse¬ 
quently,  coding  regions  will  have  have  higher  en¬ 
tropy  and  lower  redundancy  than  noncoding  re¬ 
gions.  Having  carried  out  a  Zipf  analysis  and  ob¬ 
tained  one  does  not  need  to  compute  a  separate 
entropy  test.  Yet  the  authors  do  so  (as  they  rec¬ 
ognize  implicitly  in  the  caption  of  figure  3  of  their 
paper). 

Putting  aside  these  and  other  possibly  grave  statis¬ 
tical  fallacies,  in  the  remainder  of  this  note  we  exhibit 
two  random  processes,  one  an  M-sided  die.  the  other  a 
finite-state  grammar,  that  are  very  different  from  each 
other  yet  yield  exact  Zipf  distributions.  We  then  review 
some  of  the  many  properties  of  natural  languages  not 
shared  by  these  two  processes.  Consequently,  even  if  we 
accept  the  results  of  the  Mantegna  et  al.  paper,  the 
inference  from  Zipf-behavior  to  a  similarity  with  natu¬ 
ral  languages  cannot  be  justified.  As  mentioned,  these 
points  have  been  discussed  more  than  thirty  years  ago  by 
Mandelbrot  (1961),  and  we  conclude  with  some  historical 
remarks  that  underscore  his  results  along  with  related, 
more  recent  work  that  has  also  examined  Zipf-behavior 
in  DNA  sequences. 

1  Zipf’s  Law  and  Random  Process: 

Some  Examples 

Zipf 's  Law  and  Random  Processes 

To  begin,  let  us  consider  two  very  different,  simple  ran¬ 
dom  processes  that  both  generate  Zipf  distributions:  an 
.U-sided  die  and  a  finite-state  grammar. 

Let  us  first  recall  Zipf ’s  “law”  itself.  Suppose  there  are 
M  “words”  in  a  system.  These  words  might  be  generated 
in  various  combinations  according  to  some  underlying 
process,  giving  rise  to  a  corpus  of  sentences,  or  more 
generally,  word  sequences.  Since  there  are  only  a  M 
words,  each  word  would  occur  multiple  times  in  a  large 
(potentially  infinite)  corpus.  One  can  then  rank  these 
words,  from  most  frequent  to  least  frequent.  Let  the 
frequency  of  the  ith  word  be  /*.  If  fi  is  proportional  to 
j-.  the  generative  process  is  said  to  obey  Zipf's  law. 

Example  1:  An  M-sided  die. 

Let  the  sequence  of  words  be  generated  by  throwing  a 
biased  M  sided  die.  In  particular,  let  the  die  be  such 
that  the  probability  of  the  ith  side  appearing  on  top  is 
given  by: 

j 

Now  consider  the  following  process: 

1.  Toss  the  biased  die. 

2.  If  the  die  shows  j,  output  word  Wj. 

3.  Repeat  1. 

Clearly,  this  process  generates  a  sequence  of  words 
where  the  first  word  is  twice  as  likely  as  the  second,  three 
times  as  likely  as  the  third,  and  so  on.  The  process  thus 
follows  Zipf’s  “law”  exactly. 


Example  2:  Finite- State  Grammars 

Next  we  consider  a  random  process  generating  “sen¬ 
tences'"  in  a  completely  different  fashion  from  example  1, 
but  still  obeying  Zipf's  law.  Rather  than  deal  with  the 
case  of  M  words  directly,  we  provide  some  intuition  in 
the  form  of  an  example  where  M  =  4.  Suppose  there 
are  four  words:  wi.  u  o,  u‘3.  and  W4.  Sentences  (word  se¬ 
quences)  are  produced  by  combining  words  in  some  fash¬ 
ion  according  to  a  grammar.  Let  us  assume  that  the 
generative  process  is  as  follows: 

1.  Start  at  the  root  node  of  the  annotated  tree  of  fig.  1. 

2.  At  each  node,  choose  to  go  down  any  of  the  con¬ 
nected  branches  (leading  to  a  daughter  node)  with 
equal  probability.  Output  the  word  Wi  if  the  branch 
is  associated  with  the  number  i.  If  the  branch  is  as¬ 
sociated  with  e.  output  nothing  (empty  string). 

3.  On  reaching  a  leaf  node,  stop. 

The  reader  will  recognize  that  this  is  a  finite-state 
grammar.  Every  path  starting  from  the  root  node  gives 
rise  to  a  sentence.  There  are  4!  different  paths,  corre¬ 
sponding  to  4!  different  leaves,  giving  rise  to  4!  possible 
sentence  types.  Since  the  paths  are  all  equally  likely, 
each  of  these  sentences  occurs  with  equal  likelihood. 

However,  due  to  the  way  m  which  the  tree  is  con¬ 
structed,  many  paths  yield  the  same  sentence.  For  ex¬ 
ample,  the  two  paths  highlighted  in  the  figure  yield  the 
same  sentence,  The  reader  can  check  that  such 

a  grammar  generates  eight  different  sentences  with  the 
associated  probabilities  in  table  1. 

If  a  corpus  of  sentences  is  generated  with  the  proba¬ 
bilities  shown  in  the  table,  then  it  can  easily  be  shown 
that  the  word  wi  occurs  twice  as  often  as  W2,  three  times 
as  often  as  W3  and  four  times  as  often  as  W4,  In  other 
words,  if  we  plot  word  frequencies,  then  they  would  fol¬ 
low  Zipf’s  law.^ 

In  general,  if  there  are  M  words,  then  one  could  con¬ 
struct  a  similar  tree.  Such  a  tree  would  have  M!  leaves, 
each  leaf  giving  rise  to  a  sentence.  The  branches  could 
be  numbered  (as  done  in  the  case  where  M  =  4)  so  that 
all  the  Ml  different  permutations  of  M  words  can  be 
generated.  Now,  as  in  the  specific  M  =  4  case,  we  re¬ 
place  some  of  the  numbers  by  e,  equivalent  to  outputting 
an  empty  string  for  that  branch.  Let  us  now  argue  that 
this  replacement  can  be  carried  out  and  yields  a  gram¬ 
mar  that  generates  a  Zipf  distribution. 

We  first  make  the  following  observations  to  describe 
what  M-tree  looks  like  before  any  such  replacements 
have  been  made.  There  are  M  branches  at  level  1.  Each 
of  these  branches  bears  a  label  from  1  to  M,  and  no  two 
branches  bears  the  same  label.  There  are  M{M  —  1) 
branches  at  level  2.  There  are  an  equal  number  of 

^Note  that  the  probability  of  occurrence  of  each  word  is 
inversely  proportional  to  its  rank.  In  a  finite  corpus,  the 
frequency  of  occurrence  need  not  be  exactly  equal  to  the 
probability.  However,  the  convergence  of  frequencies  to  their 
underlying  expectations  make  it  more  and  more  likely  that 
frequency-rank  behavior  will  follow  Zipf’s  law  as  the  number 
of  sentences  in  the  corpus  increases,  with  convergence  in  the 
limit  as  the  corpus  size  goes  to  infinity. 
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Figure  1:  A  tree  diagram  representation  of  a  finite  state  grammar. 


Sentence 

Wi 

WiW2 

WoVJi 

u?3iri 

Ws  Wo  Wi 

W4W1 

W4W2W1 

W4Ws'Wx 

Prob. 

1/6 

1/12 

1/4 

1/6 

1/12 

1/12 

1/12 

1/12 

Table  1:  Sentences  generated  by  the  finite  state  grammar 
generated. 

branches  bearing  each  label  from  1  to  M.  Consequently, 
.U-  1  of  the  branches  at  level  2,  are  labelled  /  for  every  / 
from  1  to  M.  Similarly,  there  M(M-l)(A/-2)  branches 
at  level  3.  with  (M  -  1)(M  -  2)  being  labelled  /  for  ev¬ 
ery  i  from  1  to  M.  As  mentioned  before,  there  are  M\ 
different  leaves,  each  giving  rise  to  a  different  sentence 
(assuming  no  label  were  replaced  by  e).  Each  sentence 
is  M  words  long,  a  permutation  of  the  M  words  with  no 
repeated  word. 

Next,  consider  how  we  replace  the  labels  by  empty 
strings  e.  Consider  all  the  branches  labelled  j.  Each 
time  such  a  branch  is  traversed,  the  grammar  outputs 
the  word  wj.  Suppose  we  chose  to  replace  some  of  the 
j  labels  by  e,  leaving  only  ai  branches  at  level  1  still 
labelled,  ao  branches  at  level  2,  and  so  on.  We  can  then 
prove  the  following  two  theorems  (given  here  without 
proof): 

Theorem  1  Suppose  ai  branches  ai  level  1  are  still  la¬ 
belled  and  the  remaining  branches  are  labelled  e.  Sim¬ 
ilarly,  suppose  an  d^e  labelled  at  level  2.  as  labelled  at 
level  3.  and  so  on.  Then  a  fraction  f  of  the  total  num¬ 
ber  of  paths  through  the  tree  yields  a  sentence  containing 
the  word  Wj,  where  f  is  given  by: 

Tl  M{M  -\)  '''  M(M-l)(M-2)  ^  M! 

Clearly,  0  <  ai  <  1;  0  <  02  <  (M  - 1),  and  in  general, 

0  <  ai  <  Given  these  constraints  on  the  Oi’s,  we 

can  also  prove  the  following: 

Theorem  2  Any  fraction  that  can  be  represented  as  ^ 
where  i  is  an  integer  between  O'  and  M\  can  be  obtained  by 
an  appropriate  setting  for  the  ai  5  under  the  constraints 
of  Theorem  1. 

A  consequence  of  these  theorems  taken  together  is 
that  one  can  generate  sentences  in  such  a  way  that  in  a 


fig.  1,  along  with  the  probability  with  which  they  are 


corpus  the  word  Wj  can  be  made  to  occur  in  only  a  frac¬ 
tion  /  =  ^  of  the  sentences.'  In  particular,  by  choosing 
k  appropriately,  we  can  make  the  jth  word,  Wj  occur 
with  frequency  l/j  in  the  text,  thus  following  Zipf’s  law 
exactly. 

2  General  Remarks  and  History 

2.1  Some  Observations  on  the  Structure  of 
Natural  Languages 

It  is  well  known  that  natural  languages  possess  many 
other  special  properties  that  are  not  tested  by  the  Zipf- 
law  behavior.  In  particular,  while  finite-state  grammars 
obey  Zipf’s  law,  it  has  long  been  known  that  they  do 
not  capture  most  of  the  striking  properties  of  natural 
languages: 

1.  Finite-state  grammars  by  algebraic  definition  can¬ 
not  express  hierarchical  relationships,  the  acknowl¬ 
edged  hallmark  of  natural  languages.  Recall  that 
finite-state  grammars  are  algebraically  associative 
concatenative  systems  (see.  e.g.,  Harrison,  1978); 
that  is,  if  £  is  a  finite-state  grammar,  then  Va,  6,  c  G 

a  •  be  e  Ciff  abc  e  C,  where  •  is  the  concate¬ 
nation  operator.  Such  a  system  cannot  even  ex¬ 
press  the  fact  that  one  and  the  same  linear  string 
of  words,  such  as  “the  deep  blue  sky”  can  have 
at  least  two  structural  (hierarchical)  bracketings: 
(the  (deep  blue)  sky)  and  (the  deep  (blue  sky)).  In 
other  words,  finite-state  grammars  can  express  only 
linear  precedence  relations,  not  hierarchical  rela¬ 
tions.  (This  demonstrates  a  failure  of  what  Chom¬ 
sky,  1956,  called  “strong  generative  capacity.”) 

2.  Finite-state  grammars,  unlike  natural  language 
grammars,  cannot  generate  arbitrarily  deep  center- 
embedded  languages  (see  Chomsky  1956,  1986,  and 
many  other  conventional  sources). 


3.  ruder  the  currently  best  working  assumptions, 
natural  language  grammars  contain  very  specific 
constraint  statements  with  proprietary  theoreti¬ 
cal  vocabularies  unlikely  to  be  duplicated  in  D\A 
"grammar, “  (e.g,,  one  component,  so-called  "trace 
theory’’  is  stated  in  terms  of  hierarchical  struc¬ 
tural  sentence  properties  and  noun  phrases,  both 
not  shared  by  DNA,  as  far  as  it  is  known ).^ 

2.2  Previous  work  on  Zipf’s  Law  and  on  DNA 
word  frequencies 

Both  Zipf's  law  and  its  application  to  DNA  secjuences 
have  a  long  history.  We  mention  only  a  few  of  the 
relevant  points  here.  In  the  1950s,  as  summarized  in 
Mandelbrot  (1961),  both  Mandelbrot,  Simon  (1955). 
and  Miller  and  Newman  (1958),  among  others,  explored 
the  nature  of  the  word-frequency  relationship  embodied 
in  Zipf’s  law.  In  particular,  Mandelbrot  showed  how 
Markovian  models  of  discourse  (subsets  of  finite-state 
models)  can  give  rise  to  Zipf-like  behavior.  Mandelbrot  is 
careful  to  note  the  well-known  inadequacy  of  such  finite- 
state  models  to  describe  linguistic  rules.  For  example,  he 
writes  (p.  191)  “the  ‘finite-state’  model  appears  as  rather 
shocking  because  of  the  well  known  existence  of  some 
long-range  influences  in  discourse,  such  as  those  studied 
by  grammar”.  He  advocates  ways  out  of  this  difficulty 
while  "acknowledging  that  the  ‘degree  of  validity’  of  the 
finite  state  model  decreases  as  the  ‘wealth’  of  grammars 
increases."  Mandelbrot  also  uses  various  information- 
theoretical  arguments  to  suggest  that  Zipf’s  law  is  not 
peculiar  to  language,  but  extends  to  any  coding  scheme 
with  a  finite  number  of  symbols — and  therefore,  can  tell 
us  relatively  little  about  any  coding  scheme  like  DNA. 

As  it  turns  out,  there  have  also  been  many  word- 
frequency  analyses  of  DNA  sequences.  As  Pevzner  et 
al.  (1989)  point  out,  “Mathematical  models  of  the  gen¬ 
eration  of  genetic  texts  appeared  simultaneously  with 
the  first  sequencing  [of  sic  pn/rcb]  DNA”.  Pevzner  et 
al.  (1989)  actually  address  the  key  question  of  variance 
and  significant  differences  explicitly,  proposing  formulae 
for  the  variance  of  number  of  word  occurrences  in  texts, 
making  it  possible  to  assess  the  significance  of  deviations 
from  expected  statistical  characteristics.  One  can  there¬ 
fore  carry  out  the  significance  tests  suggested  earlier  in 
this  note. 

3  Conclusions 

We  have  argued  that  an  observation  of  Zipf-like  behav¬ 
ior  provides  very  little  information  about  the  nature  of 
the  underlying  process  generating  such  frequency  data. 
This  is  simply  because  the  underlying  generative  pro¬ 
cesses  could  be  as  diverse  as  M-sided  dies,  simple  finite- 
state  grammars,  DNA  sequences,  and  natural  languages. 
Inferring  that  noncoding  DNA  sequence  grammars  are 
like  natural  language  grammars  solely  on  the  basis  of 

"We  should  point  out  that  some  researchers,  e.g.,  Searls, 
1993,  maintain  the  contrary  position  and  argue  that  natural 
language  and  DN.A.  grammars  share  at  least  some  generative 
processes.  A  discussion  of  this  point  is  beyond  the  scope  of 
this  note. 


Zipf-behavior  is  at  best  premature,  and  indeed  at  worst 

is  likely  to  be  completely  misleading  and  false. 
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